Add bounded timeouts to BlobStore shutdown join() calls by joewiz · Pull Request #6183 · eXist-db/exist

joewiz · 2026-03-26T01:57:18Z

Summary

BlobStoreImpl.normalClose() and abnormalPersistentWriterShutdown() called Thread.join() with no timeout on the PersistentWriter and BlobVacuum threads. If either thread failed to terminate promptly, the shutdown would hang indefinitely — causing CI test suite timeouts when the surefire fork couldn't exit.

This supersedes #6179, which incorrectly made BlobStore threads daemon threads. As @adamretter (author of BlobStore) correctly pointed out, these threads must remain non-daemon to ensure pending blob writes complete. The fix is bounded join() timeouts during shutdown — giving threads 30s to drain before proceeding.

What changed

BlobStoreImpl.java — Added 30-second bounded timeouts to all three Thread.join() calls:

normalClose() — PersistentWriter: join(30_000), then interrupt() + join(5_000) if still alive
normalClose() — BlobVacuum: join(30_000) with warning if still alive
abnormalPersistentWriterShutdown() — BlobVacuum: join(30_000) with warning if still alive

Why this is safe

The existing shutdown mechanisms are correct and unchanged:

PersistentWriter receives a POISON_PILL via the persist queue → drains remaining items and exits
BlobVacuum receives interrupt() → wakes from PriorityBlockingQueue.take() and exits

In practice, both threads respond within milliseconds. The 30s timeout is a safety net — it only fires if something goes wrong, ensuring shutdown completes instead of hanging forever.

Test results

27/27 BlobStore-specific tests pass
3/3 full test suite trials: BUILD SUCCESS, clean exit (3:26–4:23)

Test plan

BlobStoreImplTest — 27 tests, 0 failures
Full mvn test -pl exist-core — 3 trials, all pass, no hangs
CI passes on all platforms

🤖 Generated with Claude Code

@adamretter

BlobStoreImpl.normalClose() and abnormalPersistentWriterShutdown() called Thread.join() with no timeout on the PersistentWriter and BlobVacuum threads. If either thread failed to terminate promptly, the shutdown would hang indefinitely — causing CI test suite timeouts. Add 30-second bounded timeouts to all join() calls. If the PersistentWriter doesn't terminate within 30s, it is interrupted and given 5 more seconds. The threads remain non-daemon to ensure pending blob writes complete during normal shutdown. This supersedes eXist-db#6179, which incorrectly made BlobStore threads daemon threads. As @adamretter (author of BlobStore) pointed out, these threads must remain non-daemon to ensure pending blob writes complete. The fix is bounded join() timeouts during shutdown — giving threads 30s to drain before proceeding. Verified: 3/3 full test suite trials pass with clean exit (3:26-4:23), 27/27 BlobStore-specific tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

adamretter · 2026-03-26T04:31:36Z

@joewiz Unfortunately, in practice this has the same corruption issue as your previous approach in #6179, it just goes abount achieving that corruption in a different manner.

joewiz · 2026-03-26T13:39:06Z

[This response was co-authored with Claude Code. -Joe]

Closing this PR. Adam is right that the bounded timeout has the same fundamental risk as the daemon thread approach — if the timeout fires and closeBlobStore() proceeds while the writer is mid-write, data corruption is possible.

The CI hang symptom that motivated this fix is now addressed through other means:

Fix BrokerPool shutdown hang causing CI timeouts #6167 (merged) and Fix ExecutorService thread leak in TransactionTestDSL #6175 (merged) fixed the dominant shutdown hangs — develop baseline is clean in our controlled experiment (5/5 trials, 0 hangs)
[ci] Add surefire fork timeouts to prevent CI hangs #6186 adds surefire fork timeouts as a safety net, ensuring CI never burns 45+ minutes on a hung fork

The underlying BlobStore shutdown behavior (non-daemon threads that can prevent JVM exit if they hang) remains as-is. It's the correct design for data integrity — these threads must complete their work before the JVM exits.

joewiz requested a review from a team as a code owner March 26, 2026 01:57

joewiz mentioned this pull request Mar 26, 2026

[ci] Add surefire fork timeouts to prevent CI hangs #6186

Open

3 tasks

joewiz marked this pull request as draft March 26, 2026 13:34

joewiz closed this Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add bounded timeouts to BlobStore shutdown join() calls#6183

Add bounded timeouts to BlobStore shutdown join() calls#6183
joewiz wants to merge 1 commit intoeXist-db:developfrom
joewiz:bugfix/blobstore-shutdown-timeout

joewiz commented Mar 26, 2026

Uh oh!

adamretter commented Mar 26, 2026

Uh oh!

joewiz commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

joewiz commented Mar 26, 2026

Summary

What changed

Why this is safe

Test results

Test plan

Uh oh!

adamretter commented Mar 26, 2026

Uh oh!

joewiz commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants